# Install all libraries
! pip install pytorch-lightning wandb rdkit ogb deepchem torch
# Download all data
! mkdir data/
! wget https://raw.githubusercontent.com/schwallergroup/ai4chem_course/main/notebooks/02%20-%20Supervised%20Learning/data/esol.csv -O data/esol.csv
! wget https://raw.githubusercontent.com/schwallergroup/ai4chem_course/main/notebooks/03%20-%20Intro%20to%20Deep%20Learning/esol_utils.py -O esol_utils.py
6 Week 3 tutorial 1 - AI 4 Chemistry
Table of content
- Supervised deep learning.
- Neural Networks.
- Creating a deep learning model.
0. Relevant packages
Pytorch
Based on the Torch library, PyTorch is one of the most popular deep learning frameworks for machine learning practitioners. We will learn to use PyTorch to do deep learning work. You can also browse the PyTorch tutorials and docs for additional details.
Pytorch Lightning
PyTorch Lightning is the deep learning framework for professional AI researchers and machine learning engineers who need maximal flexibility without sacrificing performance at scale. You can also browse its documentation for additional details.
Weights & Biases (W&B)
Weights & Biases is the machine learning platform for developers to build better models faster. Use W&B’s lightweight, interoperable tools to quickly track experiments, version and iterate on datasets, evaluate model performance, reproduce models, visualize results and spot regressions, and share findings with colleagues. You can also browse its documentation for additional details.
Exercise: Create a W&B account.
Go to W&B and create an account. We will be using this platform to track our experiments!
Set a random seed to ensure repeatability of experiments
import random
import numpy as np
import torch
# Random Seeds and Reproducibility
0)
torch.manual_seed(0)
torch.cuda.manual_seed(0)
np.random.seed(0) random.seed(
1. Supervised Deep Learning
From last session we should already be familiar with supervised learning: is a type of machine learning that involves training a model on a labeled dataset to learn the relationships between input and output data.
The models we saw so far are fairly easy and work well in some scenarios, but sometimes it’s not enough. What to do in these cases?
Deep Learning
Deep learning is a subset of machine learning that involves training artificial neural networks to learn from data. Unlike traditional machine learning algorithms, which often rely on hand-crafted features and linear models, deep learning algorithms can automatically learn features and hierarchies of representations from raw data. This allows deep learning models to achieve state-of-the-art performance on a wide range of tasks in chemistry, like molecular property prediction, reaction prediction and retrosynthesis, among others.
Data: Let’s go back to the ESOL dataset from last week.
We will use this so we can compare our results with the previous models. We’ll reuse last week’s code for data loading and preprocessing.
from esol_utils import load_esol_data
= load_esol_data() (X_train, X_valid, X_test, y_train, y_valid, y_test, scaler)
2. Neural Networks
Neural Networks are a type of machine learning model that is designed to simulate the behavior of the human brain.
They consist of layers of interconnected nodes, and each node applies a linear function
to its inputs. Non-linear activation functions are used to introduce non-linearity
into the model, allowing it to learn more complex patterns in the data.
import os
import torch
import wandb
from torch import nn
import torch.nn.functional as F
import pytorch_lightning as pl
from torch.utils.data import DataLoader
from pytorch_lightning.loggers import WandbLogger
3. Creating a deep learning model.
Creating DL models is fairly easy nowadays, specially thanks to libraries like Pytorch Lightning. They do most of the work for you, but they still alow you to have a lot of control over your models.
To use Pytorch Lightning, we first need to know about classes.
Think of a class as a template or a set of instructions for creating objects with specific properties and behaviors. These objects are called instances of the class.
For example, let’s say you want to make a program to represent dogs.
class Dog:
def __init__(self, name, color):
self.name = name
self.color = color
def say_your_name(self):
print(f"My name is {self.name}")
In this example, a dog has two attributes: name
and color
. It also has a method: say_your_name
.
Now we can create as many dogs as we want! For example
= Dog(name = "Lassie", color = "White")
lassie = Dog(name = "Pluto", color = "Yellow") pluto
And we can access their methods as follows:
# Output: "My name is Pluto" pluto.say_your_name()
Now let’s define a NeuralNetwork class.
- What is each part?
__init__
is where we specify the model architecture, There are loads of layers (model parts) you can use, and it’s all defined here.training step
is one of our model’s methods. It updates the model paramters using an optimizer.configure_optimizers
, well, configures the optimizers 😅.
Here we define what optimizer to use, including learning rate.forward
specifices what the model should do when an input is given.
class NeuralNetwork(pl.LightningModule):
def __init__(self, input_sz, hidden_sz, train_data, valid_data, test_data, batch_size=254, lr=1e-3):
super().__init__()
self.lr = lr
self.train_data = train_data
self.valid_data = valid_data
self.test_data = test_data
self.batch_size = batch_size
# Define all the components
self.model = nn.Sequential(
nn.Linear(input_sz, hidden_sz),
nn.ReLU(),
nn.Linear(hidden_sz, hidden_sz),
nn.ReLU(),1)
nn.Linear(hidden_sz,
)
def training_step(self, batch, batch_idx):
# Here we define the train loop.
= batch
x, y = self.model(x)
z = F.mse_loss(z, y)
loss self.log("Train loss", loss)
return loss
def validation_step(self, batch, batch_idx):
# Define validation step. At the end of every epoch, this will be executed
= batch
x, y = self.model(x)
z = F.mse_loss(z, y) # report MSE
loss self.log("Valid MSE", loss)
def test_step(self, batch, batch_idx):
# What to do in test
= batch
x, y = self.model(x)
z = F.mse_loss(z, y) # report MSE
loss self.log("Test MSE", loss)
def configure_optimizers(self):
# Here we configure the optimization algorithm.
= torch.optim.Adam(
optimizer self.parameters(),
=self.lr
lr
)return optimizer
def forward(self, x):
# Here we define what the NN does with its parts
return self.model(x).flatten()
def train_dataloader(self):
return DataLoader(self.train_data, batch_size=self.batch_size, shuffle=True)
def val_dataloader(self):
return DataLoader(self.valid_data, batch_size=self.batch_size, shuffle=False)
def test_dataloader(self):
return DataLoader(self.test_data, batch_size=self.batch_size, shuffle=False)
Dataset class
To use Lightning, we also need to create a Dataset
class.
It looks more complicated, but it actually allows a lot of flexibility in more complex scenarios! (so don’t be daunted by this 😉)
from torch.utils.data import Dataset
class ESOLDataset(Dataset):
def __init__(self, X, y):
self.X = X
self.y = y
def __len__(self):
return self.X.shape[0]
def __getitem__(self, idx):
if torch.is_tensor(idx):
= idx.tolist()
idx = torch.as_tensor(self.X[idx].astype(np.float32))
X_ = torch.as_tensor(self.y[idx].astype(np.float32).reshape(-1))
y_
return X_, y_
= ESOLDataset(X_train, y_train)
train_data = ESOLDataset(X_valid, y_valid)
valid_data = ESOLDataset(X_test, y_test) test_data
# This will ask you to login to your wandb account
="nn-solubility",
wandb.init(project={
config"batch_size": 32,
"learning_rate": 0.001,
"hidden_size": 512,
"max_epochs": 100
})
# Here we create an instance of our neural network.
# Play around with the hyperparameters!
= NeuralNetwork(
nn_model = X_train.shape[1],
input_sz = wandb.config["hidden_size"],
hidden_sz = train_data,
train_data = valid_data,
valid_data = test_data,
test_data = wandb.config["learning_rate"],
lr =wandb.config["batch_size"]
batch_size
)
# Define trainer: How we want to train the model
= WandbLogger()
wandb_logger = pl.Trainer(
trainer = wandb.config["max_epochs"],
max_epochs = wandb_logger
logger
)
# Finally! Training a model :)
trainer.fit(=nn_model,
model
)
# Now run test
= trainer.test(ckpt_path="best")
results wandb.finish()
# Test RMSE
= results[0]["Test MSE"]
test_mse = test_mse ** 0.5
test_rmse print(f"\nANN model performance: RMSE on test set = {test_rmse:.4f}\n")
Exercise:
Play with the hyperparameters, see what you get.
You may play around with hidden_sz
, batch_sz
, max_epochs
, lr
,
or even modify the architecture of our neural network i.e. change the number of layers, activation function, etc.